Authors : Colin Campbell (c_c953), Leah Lewis (lrl68), Ryan Wakabayashi (rjw102) and Jake Worden (jrw294)
Abstract : Asteroid prediction is an ever-growing need with the increase in space travel and thoughts of space expansion. This project attempts to find the best performing models to predict the diameters of unknown asteroids as well as classify them as physically hazardous. After utilizing models for both Regression and Classification, the Random Forest Regressor and Decision Tree Classifier where shown to have the highest perfomance in accordance to our target success goals. Therefore, with some further work and tuning, these models could provide a potentially powerful means for identifying unknown small-bodies for planetary defense.
The ability to use the collected data on nearby asteroids and determine whether they present a threat to life on Earth is critical for the future of civilization. Companies like Nasa and SpaceX are currently working on technologies to identify these celestial threats and their trajectory. As the amount of data we can collect increases, it is imperative that we determine which attributes are key to identify how hazardous an asteroid is to Earth. The goal of this project is to use machine learning techniques to accurately predict an asteroids diameter and predict if an asteroid is physically hazardous to life on Earth based on features of the unknown asteroid.
Given a dataset of asteroid features, can machine learning be used to predict an unknown asteriod's diameter and determine whether it's physically hazardous or not in accordance to the following sucess measures?
| ML Approach | CV Score | F1 | Precision | Recall | R2 | MSE | MAPE |
|---|---|---|---|---|---|---|---|
| Predicting Diameter | >85% | - | - | - | >85% | ≤ 1.2 | ≤ 25 |
| Classifying Physically Hazardous Asteroid | ≥80% | ≥80% | ≥80% | ≥80% | - | - | - |
Each of the following kaggle projects attempted to apply machine learning to predict diameter of asteroids and served as a solid benchmark for this project:
| Kaggle Project Title | Project Author | Project Description | Project Results | Link to Project |
|---|---|---|---|---|
| Asteroid Diameter Estimators | Liam Toran | Supervised Regression Targeting Diameter | R2: 84.5% (XGBoost) | See Kaggle |
| Asteroid Diameter Prediction | TitanPointe | Evaluate MSE of ML Models for Asteroid Prediction | MSE: 12 (Random Forrest) | See Kaggle |
Links to the database and dataset used can be found below:
Small-Body DataBase Link: Jet Propulsion Laboratory Solar System Dynamics
Open Asteroid Dataset Link: Asteroid_Updated.csv
| Feature | Description | |
|---|---|---|
| a | Semi-major axis(au) | |
| e | Eccentricity | |
| i | Inclination with respect to x-y ecliptic plain(deg) | |
| om | Longitude of the ascending node | |
| w | Argument of perihelion | |
| q | Perihelion distance(au) | |
| ad | Aphelion distance(au) | |
| per_y | Oribital period(YEARS) | |
| data_arc | Data arc-span(d) | float64 |
| condition_code | Orbit condition code | |
| n_obs_used | Number of Observation used | |
| H | Absolute magnitude parameter | |
| neo | Near Earth Object | object |
| pha | Physically Hazardous Asteroid | |
| diameter | Diameter of asteroid(Km) | |
| extent | Object bi/tri axial ellipsoid dimensions(Km) | |
| albedo | Geometric albedo | |
| rot_per | Rotation Period(h) | |
| GM | Standard gravitational parameter, Product of mass and gravitational constant | |
| BV | Color index B-V magnitude difference | |
| UB | Color index U-B magnitude difference | |
| IR | Color index I-R magnitude difference | |
| spec_B | Spectral taxonomic type(SMASSII) | |
| spec_T | Spectral taxonomic type(Tholen) | |
| G | Magnitude slope parameter | |
| moid | Earth minimum orbit intersection distance(au) | |
| class | Asteroid orbit class | |
| n | Mean motion(deg/d) | |
| per | Orbital period(d) | |
| ma | Mean anomaly(deg) |
Importing all libraries for data gathering
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
Mounted at /content/drive
Read the csv file using pandas read_csv() method and print the first five entries
df = pd.read_csv("/content/drive/MyDrive/ML/Project/Asteroid_Updated.csv")
#df = pd.read_csv("../Asteroid_Updated.csv")
df.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (0,10,15,16,23,24) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
| name | a | e | i | om | w | q | ad | per_y | data_arc | condition_code | n_obs_used | H | neo | pha | diameter | extent | albedo | rot_per | GM | BV | UB | IR | spec_B | spec_T | G | moid | class | n | per | ma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ceres | 2.769165 | 0.076009 | 10.594067 | 80.305532 | 73.597694 | 2.558684 | 2.979647 | 4.608202 | 8822.0 | 0 | 1002 | 3.34 | N | N | 939.4 | 964.4 x 964.2 x 891.8 | 0.0900 | 9.074170 | 62.6284 | 0.713 | 0.426 | NaN | C | G | 0.12 | 1.59478 | MBA | 0.213885 | 1683.145708 | 77.372096 |
| 1 | Pallas | 2.772466 | 0.230337 | 34.836234 | 173.080063 | 310.048857 | 2.133865 | 3.411067 | 4.616444 | 72318.0 | 0 | 8490 | 4.13 | N | N | 545 | 582x556x500 | 0.1010 | 7.813200 | 14.3000 | 0.635 | 0.284 | NaN | B | B | 0.11 | 1.23324 | MBA | 0.213503 | 1686.155999 | 59.699133 |
| 2 | Juno | 2.669150 | 0.256942 | 12.988919 | 169.852760 | 248.138626 | 1.983332 | 3.354967 | 4.360814 | 72684.0 | 0 | 7104 | 5.33 | N | N | 246.596 | NaN | 0.2140 | 7.210000 | NaN | 0.824 | 0.433 | NaN | Sk | S | 0.32 | 1.03454 | MBA | 0.226019 | 1592.787285 | 34.925016 |
| 3 | Vesta | 2.361418 | 0.088721 | 7.141771 | 103.810804 | 150.728541 | 2.151909 | 2.570926 | 3.628837 | 24288.0 | 0 | 9325 | 3.20 | N | N | 525.4 | 572.6 x 557.2 x 446.4 | 0.4228 | 5.342128 | 17.8000 | 0.782 | 0.492 | NaN | V | V | 0.32 | 1.13948 | MBA | 0.271609 | 1325.432765 | 95.861936 |
| 4 | Astraea | 2.574249 | 0.191095 | 5.366988 | 141.576605 | 358.687607 | 2.082324 | 3.066174 | 4.130323 | 63507.0 | 0 | 2916 | 6.85 | N | N | 106.699 | NaN | 0.2740 | 16.806000 | NaN | 0.826 | 0.411 | NaN | S | S | NaN | 1.09589 | MBA | 0.238632 | 1508.600458 | 282.366289 |
Use pandas shape method to identify the amount of data available
df.shape
(839714, 31)
Use pandas info method to identify the data types, null values and memory usage
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 839714 entries, 0 to 839713 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 21967 non-null object 1 a 839712 non-null float64 2 e 839714 non-null float64 3 i 839714 non-null float64 4 om 839714 non-null float64 5 w 839714 non-null float64 6 q 839714 non-null float64 7 ad 839708 non-null float64 8 per_y 839713 non-null float64 9 data_arc 824240 non-null float64 10 condition_code 838847 non-null object 11 n_obs_used 839714 non-null int64 12 H 837025 non-null float64 13 neo 839708 non-null object 14 pha 823272 non-null object 15 diameter 137636 non-null object 16 extent 18 non-null object 17 albedo 136409 non-null float64 18 rot_per 18796 non-null float64 19 GM 14 non-null float64 20 BV 1021 non-null float64 21 UB 979 non-null float64 22 IR 1 non-null float64 23 spec_B 1666 non-null object 24 spec_T 980 non-null object 25 G 119 non-null float64 26 moid 823272 non-null float64 27 class 839714 non-null object 28 n 839712 non-null float64 29 per 839708 non-null float64 30 ma 839706 non-null float64 dtypes: float64(21), int64(1), object(9) memory usage: 198.6+ MB
From the data gathering, the following was obtained in relation to the initial inquries regarding the dataset:
Based on this information, the data will need to undergo some exstensive pre-processing prior to any exploratory data analysis .
Initial data gathering showed that the dataset is comprised of 839,714 data entries consisting of 31 variables made up of 3 different data types (float64, int64, object). Additionaly, it was determined that while some of the data entries' features contained no null values, it was shown that approximately 700,000 entries contained at least one null value if not many more. Since the goal of this project is to create both a regressor and a classifer for targeting diameter and physically harzardous asteriod respectively there is a need to address the frequency of null value occurences throughout the dataset.
Importing all libraries for data pre-processing, cleaning, labeling and maintenance
from sklearn.preprocessing import LabelEncoder as le
from sklearn.utils import resample
The results of the initial data gathering had shown the existence of a larger quantity of null values within the dataset. The pandas module can be used to print the sum of null values to determine which columns had a high percentage of null values. If either of the targets (diameter, pha) are present amoungst the columns, additional data cleaning will be needed prior to any exploratory data analysis. Otherwise any other column found to contain a large frequency of null values can be discarded from the dataframe as they will not be useful for either regression or classification.
Use pandas isnull and sum methods to print the sum of null values within the dataframe
print("The Sum of Null Values in The Dataframe\n" + "="*40 + "\n" + str(df.isnull().sum()))
The Sum of Null Values in The Dataframe ======================================== name 817747 a 2 e 0 i 0 om 0 w 0 q 0 ad 6 per_y 1 data_arc 15474 condition_code 867 n_obs_used 0 H 2689 neo 6 pha 16442 diameter 702078 extent 839696 albedo 703305 rot_per 820918 GM 839700 BV 838693 UB 838735 IR 839713 spec_B 838048 spec_T 838734 G 839595 moid 16442 class 0 n 2 per 6 ma 8 dtype: int64
As shown in the above cell output, 12 of the 31 features contained about 700,000 null values. Amongst these 12 features was diameter, the target for regression, with a total sum of 702,078 null values. The removal of these entries with null values for diameter would greatly diminish the size of the dataset, therefore the remaining 11 features can be dropped from the dataframe for now.
Use pandas drop method to drop the features from the dataframe
columns = ['name', 'extent', 'albedo', 'rot_per', 'GM', 'BV', 'G', 'UB', 'IR', 'spec_B', 'spec_T']
df.drop(columns=columns, inplace=True)
Now that the 11 features containing more than 700,000 null values have been dropped from the dataframe, pandas can used again to determine the remaining number of null values needing to processed prior to exploratory data analysis.
Use pandas isnull and sum methods to print the sum of null values within the dataframe
print("The Sum of Null Values in The Dataframe\n" + "="*40 + "\n" + str(df.isnull().sum()))
The Sum of Null Values in The Dataframe ======================================== a 2 e 0 i 0 om 0 w 0 q 0 ad 6 per_y 1 data_arc 15474 condition_code 867 n_obs_used 0 H 2689 neo 6 pha 16442 diameter 702078 moid 16442 class 0 n 2 per 6 ma 8 dtype: int64
✨Voila!✨ From the above cell it is shown that the dataframe contains significanty less null values and the remaining data is now constrained only by the null values present in the target features. Any further processing will need to address the specific data needs as it pertains to either classification or regression.
During data gathering, there were 9 categorical features observed to be of type object within the dataset. Of these 9 features, 5 were dropped during the initial cleaning and the remaining 4 features, pha, neo, condition_code and class contained only categorical data. In order for these remaining 4 features to be used by a machine learning algorithm they must first be converted from categorical to numerical data types. After converting these features, any further remaining data of incorrect type must also be converted to numeric values before attempting to drop the remaining null values from the dataframe.
Create seperate dataframe to be used for regression
df_regression = df
Currently the values belonging to both pha and neo features can either be 'Y' or 'N', while the values belonging to the condition_code feature range from 0 to 9, 'D', and 'E'. By using the map method provided by the pandas module, each value within the provided dictionary can easily be mapped to corresponding nurmerical values.
Use dictionaries along with pandas map method to transform categorical data to numerical data
df_regression['pha'] = df_regression['pha'].map({'Y': 1, 'N': 0})
df_regression['neo'] = df_regression['neo'].map({'Y': 1, 'N': 0})
df_regression['condition_code'] = df_regression['condition_code'].map({0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 'D': 11, 'E': 12})
Results from the above transformation now have pha and neo features containing either the value 0 or 1 and the new condition_code feature values ranging from 1 to 12.
Unlike the values belonging to the 3 features converted in the above cell, the values belonging to the class feature do not share any sort of sequential meaning or state with one another. Therefore, a label encoder can be used to map the feature values to arbitrary numerical representations.
Use Label Encoder to transform categorical data to numerical data
df_regression['class'] = le().fit_transform(df_regression['class'])
Now that each of the features initialy containing categorical objects have all been transformed to numerical data types, pandas can be used on the dataframe to change any non-convertable values to 'NaN' within each feature. By changing non-numeric values to be NaN, it will ensure that after the final drop of null values the dataframe will only contain numeric values.
Use pandas apply and to_numeric methods to change any non-convertable values to NaN
df_regression = df_regression.apply(lambda x: pd.to_numeric(x, errors='coerce'))
From here the dataframe should only consist of either numeric, NaN or null data types. Therefore the remaining features containing only null values will now be dropped from the dataframe. Additionally, any entries within the dataframe containing any null values will also be removed from the dataframe.
Use pandas dropna method to remove null values from the dataframe
df_regression.dropna(how='all', axis=1, inplace=True)
df_regression.dropna(how='any', axis=0, inplace=True)
To verify that the dataframe for regression is now cleaned, pandas can be used to show that there are currently no null values within the dataset and that all of values within the dataset are only of numeric types.
Use pandas isnull and sum methods to print the sum of null values within the dataframe
print("The Sum of Null Values in The Dataframe\n" + "="*40 + "\n" + str(df_regression.isnull().sum()))
The Sum of Null Values in The Dataframe ======================================== a 0 e 0 i 0 om 0 w 0 q 0 ad 0 per_y 0 data_arc 0 condition_code 0 n_obs_used 0 H 0 neo 0 pha 0 diameter 0 moid 0 class 0 n 0 per 0 ma 0 dtype: int64
Use pandas dtypes method to show the catalog of data types in the dataframe
df_regression.dtypes
a float64 e float64 i float64 om float64 w float64 q float64 ad float64 per_y float64 data_arc float64 condition_code float64 n_obs_used int64 H float64 neo float64 pha float64 diameter float64 moid float64 class int32 n float64 per float64 ma float64 dtype: object
The dataset for regression is now cleaned and ready for some exploratory data analysis! But before taking a deep dive into it, it may be useful to see the final size of the dataset that's going to be used.
Use pandas shape method to identify the dimensions of the dataframe
df_regression.shape
(127910, 20)
✨Wowza!✨ Following the data cleaning for regression, the dataset went from containing 839,714 entries with 31 features to only containing 127,910 entries with just 20 features! Now that the dataset for regression is finalized, exploratory data analysis will help determine which of these 20 features, if any, can be used to model asteroid diameter.
One major hurdle to overcome when pre-processing data for classification is the posible existence of class imbalances. Class imbalances refer to higher instances of one particular class that will lead to biases in the classifiers. These biases toward the majority class can then result in bad classifications of the minority class. For example, if the originally the dataset had a distribution of 499,000 instances of class 0 and 1000 instances of class 1, then the classification scores will be based upon how accurate the classifer is at predicting instances being of class 0. Therefore, before handling any non-numeric and null datatypes in the datasets it would be best to address any significant class imbalances present within the dataframe.
Create seperate dataframe to be used for classification
df_classification = df
Prior to handling any non-numeric datatypes and null values in the datasets, pandas can be used to determine whether the the classifer's target feature, pha, has a significant class imbalance.
Use pandas groupby method to output the class size distribution
class_counts = df_classification.groupby('pha').size()
class_counts
pha N 821257 Y 2015 dtype: int64
The results from the above cell shows that the pha feature's 0 class has 821,257 instances, while the 1 class has only 2015. This significant disproportion, if not adjusted, will lead to a high bias towards the majority class, 0, and give false high accuracy scores. To solve this problem, a Variational Autoencoder (VAE) will be needed to generate 6000 extra samples for the minority class.
After using the VAE.ipynb included within this repository, new data was appended to the original csv and saved as a new csv.
Read the new csv file created by the VAE using pandas read_csv() method and print the first five entries
df_classification = pd.read_csv("/content/drive/MyDrive/ML/Project/Asteroid_VAE_data.csv")
#df_classification = pd.read_csv("../Asteroid_VAE_data.csv")
df.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (9,17,19,20,27,28) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
| name | a | e | i | om | w | q | ad | per_y | data_arc | condition_code | n_obs_used | H | neo | pha | diameter | extent | albedo | rot_per | GM | BV | UB | IR | spec_B | spec_T | G | moid | class | n | per | ma | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ceres | 2.769165 | 0.076009 | 10.594067 | 80.305532 | 73.597694 | 2.558684 | 2.979647 | 4.608202 | 8822.0 | 0 | 1002 | 3.34 | N | N | 939.4 | 964.4 x 964.2 x 891.8 | 0.0900 | 9.074170 | 62.6284 | 0.713 | 0.426 | NaN | C | G | 0.12 | 1.59478 | MBA | 0.213885 | 1683.145708 | 77.372096 |
| 1 | Pallas | 2.772466 | 0.230337 | 34.836234 | 173.080063 | 310.048857 | 2.133865 | 3.411067 | 4.616444 | 72318.0 | 0 | 8490 | 4.13 | N | N | 545 | 582x556x500 | 0.1010 | 7.813200 | 14.3000 | 0.635 | 0.284 | NaN | B | B | 0.11 | 1.23324 | MBA | 0.213503 | 1686.155999 | 59.699133 |
| 2 | Juno | 2.669150 | 0.256942 | 12.988919 | 169.852760 | 248.138626 | 1.983332 | 3.354967 | 4.360814 | 72684.0 | 0 | 7104 | 5.33 | N | N | 246.596 | NaN | 0.2140 | 7.210000 | NaN | 0.824 | 0.433 | NaN | Sk | S | 0.32 | 1.03454 | MBA | 0.226019 | 1592.787285 | 34.925016 |
| 3 | Vesta | 2.361418 | 0.088721 | 7.141771 | 103.810804 | 150.728541 | 2.151909 | 2.570926 | 3.628837 | 24288.0 | 0 | 9325 | 3.20 | N | N | 525.4 | 572.6 x 557.2 x 446.4 | 0.4228 | 5.342128 | 17.8000 | 0.782 | 0.492 | NaN | V | V | 0.32 | 1.13948 | MBA | 0.271609 | 1325.432765 | 95.861936 |
| 4 | Astraea | 2.574249 | 0.191095 | 5.366988 | 141.576605 | 358.687607 | 2.082324 | 3.066174 | 4.130323 | 63507.0 | 0 | 2916 | 6.85 | N | N | 106.699 | NaN | 0.2740 | 16.806000 | NaN | 0.826 | 0.411 | NaN | S | S | NaN | 1.09589 | MBA | 0.238632 | 1508.600458 | 282.366289 |
After reading the new csv generated by the VAE, pandas can be used to recheck the the class size distrubition of pha to verify the exisitence of new samples belonging to the minority class.
Use pandas groupby method to output the class size distribution
class_counts = df_classification.groupby('pha').size()
class_counts
pha 1.0 6000 N 821257 Y 2015 dtype: int64
The new class distribution for pha shows the addition of a new '1.0' class comprised of the 6000 values generated by the VAE. This discrepancy can be resolved by using the same approach taken when cleaning the dataframe for regression. Except this time the pha feature value'1.0' will also be mapped to '1' along with the 'Y' values. Additionally, the mapping of the other categorical datatypes to numerical datatypes, as seen when cleaning the regression dataframe, can also be completed at this time.
Use dictionaries along with pandas map method to transform categorical data to numerical data
df_classification['pha'] = df_classification['pha'].map({'1.0' : 1, 'Y': 1, 'N': 0})
df_classification['neo'] = df_classification['neo'].map({'1.0' : 1, 'Y': 1, 'N': 0})
df_classification['condition_code'] = df_classification['condition_code'].map({'0': 1, '1': 2, '2': 3, '3': 4, '4': 5, 5:6, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10, 'D': 11, 'E': 12, -0.0:1, -1.0:1, 2.0:2, 3.0:3, 4.0:4, 5.0:5, 6.0:6, 7.0:7, 8.0:8, 9.0:9})
df_classification['condition_code'] = df_classification['condition_code'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Use Label Encoder to transform categorical data to numerical data
df_classification['class'] = le().fit_transform(df_classification['class'])
This portion of code is to visualize what we originally had as our classification data. We noticed that we were being left with a very small amount of samples for class 1 so we tried dropping other columns with higher null values to see if they were the cause of the drops.
df_with_cols = df_classification.dropna(how='all', axis=1)
df_with_cols = df_classification.dropna(how='any', axis=0)
class_counts = df_with_cols.groupby('pha').size()
class_counts.shape
(0,)
By dropping none of the features with high null values we are left with an empty dataset, therefore we need to drop at least those with majority nulls like we did in the regression section.
columns_test = ['Unnamed: 0', 'name', 'extent', 'albedo', 'rot_per', 'GM', 'BV', 'UB', 'IR', 'spec_B', 'spec_T', 'G']
df_test = df_classification.drop(columns=columns_test)
df_test = df_test.dropna(how='all', axis=1)
df_test = df_test.dropna(how='any', axis=0)
class_counts = df_test.groupby('pha').size()
class_counts
pha 0.0 135626 1.0 207 dtype: int64
In the above output there are only 207 samples for class 1. By dropping the featres diameter, w, per, and ma we saw it better maintain the data's integrity and only drop a few samples from class 1.
Use pandas drop method to drop the features from the dataframe
columns = ['Unnamed: 0', 'name', 'extent', 'albedo', 'rot_per', 'GM', 'BV', 'UB', 'IR', 'spec_B', 'spec_T', 'G', 'condition_code', 'diameter', 'w', 'per', 'ma']
df_classification = df_classification.drop(columns=columns)
Now that all unnecessary columns have been dropped from the dataframe and all incorrect datatypes have been corrected, the remaining features containing only null values can be dropped from the dataframe as well as any entries containing any null values.
Use pandas dropna method to remove null values from the dataframe
df_classification.dropna(how='all', axis=1, inplace=True)
df_classification.dropna(how='any', axis=0, inplace=True)
To verify that the dataframe for classification is now cleaned, pandas can be used to show that there are currently no null values within the dataset and that all of values within the dataset are only of numeric types.
Use pandas isnull and sum methods to print the sum of null values within the dataframe
print("The Sum of Null Values in The Dataframe\n" + "="*40 + "\n" + str(df_classification.isnull().sum()))
The Sum of Null Values in The Dataframe ======================================== a 0 e 0 i 0 om 0 q 0 ad 0 per_y 0 data_arc 0 n_obs_used 0 H 0 neo 0 pha 0 moid 0 class 0 n 0 dtype: int64
Use pandas dtypes method to show the catalog of data types in the dataframe
df_classification.dtypes
a float64 e float64 i float64 om float64 q float64 ad float64 per_y float64 data_arc float64 n_obs_used float64 H float64 neo float64 pha float64 moid float64 class int32 n float64 dtype: object
The classification data frame now contains 0 null values throughout all 15 features and all features have numeric data types. Although this was a sufficiant stopping point for regression, following the creation of the additional 6000 minority class values for pha, there still exists a significant class imbalance needing to be addressed. The pandas module can again be used to display the current class distribution for pha within the dataframe for classification following the dropping of null values.
Use pandas groupby method to output the class size distribution
class_counts = df_classification.groupby('pha').size()
class_counts
pha 0.0 818167 1.0 8013 dtype: int64
As seen in the above cell, the pha feature's class distribution following the data cleaning is as follows - 818,167 instances of class 0 and 8013 of class 1. At this point in time, the resample method provided by the sklearn utility module can be used to downsample the majority class '0.0' frequency to that of the minority class '1.0'. This will remove the large bias encountered by our original, imbalanced dataset and give more accurate results.
Use sklearn resample method to downsample the majority class to match the minority class
# Separate majority and minority classes
df_majority = df_classification[df_classification['pha'].iloc[:,]==0]
df_minority = df_classification[df_classification['pha'].iloc[:,]==1]
# Downsample majority class
df_majority_downsampled = resample(df_majority, replace=False, n_samples=8013)
df_classification = pd.concat([df_majority_downsampled, df_minority])
Following the downsampling of the majority class, use pandas to see the resulting class distribution.
Print the counts of each classification for the feature pha
class_counts = df_classification.groupby('pha').size()
class_counts
pha 0.0 8013 1.0 8013 dtype: int64
Finally the pha classes are balanced and we can continue to EDA for both datasets.
Use pandas shape method to identify the dimensions of the dataframe
df_classification.shape
(16026, 15)
Following the data cleaning and class balancing needed for a classifer, the dataset went from containing 839,714 entries with 31 features to only containing 16,026 entries with just 15 features!😲 While only having a dataset consiting of 16,026 entries is not ideal, it ensure that any classification model made will contain no bias towards a particular class. And now that the dataset for classification is finalized, exploratory data analysis can help determine which of these 15 features, if any, can be used to determine whether a unknown asteroid could be considered physically hazardous.
Now that the dataframes for both regression and classification have been cleaned, in order to identify features with strong relationships and correlation values in their respective datasets, visualizations and feature selection tools can be used to provide both graphical and mathematical insight into the identification of which features, if any, have the strongest relationship to the targets. Seaborn pairplot helps display the relationship between features and the target values. A Correlation heatmap can visualize the strength of correlation, positive or negative, between features and targets. The correlation value can range from -1 to 1. Analysis of Variance (ANOVA) is a popular tool for statistics. ANOVA helps find out whether the differences between groups of data are statistically significant. The F-score calculation within ANOVA determines the ratio of explained variance to unexplained variance.
Both the regression and classification dataframes will seperately undergoe the same exploratory data anaylsis process.
Import all libraries for exploratory data analysis
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
The first method that will be visualized is a seaborn pairplot of the regression dataframe. This will visualize the features correlation in a graphical format so that patterns may be recognized.
sns.set(style="dark", color_codes=True)
g = sns.pairplot(data=df_regression)
plt.show()
From the pair plot above, some of the features can be selected for modeling diameter prediction. The pair plot allows for the important features to be selected for prior to modeling. As seen above, features like Orbital Period (per_y), Absolute Magnitude Parameter (H), Perihelion distance (q), and Mean motion (n) should be selected as parameters of diameter prediction because a pattern can be visually seen.
Next a heatmap on the classification data frame will be created. This heatmap will focus on correlation of the Diameter (diameter) feature.
plt.subplots(figsize = (16,15))
sns.heatmap(df_regression.corr(),annot=True, annot_kws={'size':10})
<AxesSubplot:>
We will select the highest correlation values with respect to diameter in the heatmap generated from the regression data frame for modeling.
Features with correlation greater of at least |0.5|:
Unfortunately, these correlation values are not very strong, meaning that diameter prediction will be difficult to achieve while using the regression dataset. Because of this, maintaining as much data as possible will be our main goal for this dataset. Hopefully with more data, the lower correlations can still be of use in predictions.
Working with large data sets and multiple models is expensive in processing power. Reducing dimensions by doing feature selection can cut down on this processing overhead. The seaborn pairplot and heatmap methods have allowed for a handful of features to be selected from the regression data set. These features can now be used for modeling, but it is helpful to check what features other methods may select. In order to ensure no useful features are left out ANOVA will be used. The scikit-learn statistical tool ANOVA, or SelectKBest specifically, can be used to automatically score how well a feature correlates to the target.
Set the target feature 'diameter' to y_regression and the other remaining features to x_regression
x_regression = df_regression.drop(columns=['diameter'])
y_regression = df_regression['diameter']
Now that x_regression and y_regression data sets contain values, an ANOVA can be performed using SelectKBest. The fit_transform method will transform x_regression to only contain our 10 best features from the dataset.
Use the SelectKBest function to create the anova then use the fit_transform method to acquire the ANOVA scores. Print the result
# ANOVA on features on target to determine which features are significant
anova = SelectKBest(k=10)
# fitting ANOVA model with features and target
bestX = anova.fit_transform(x_regression, y_regression)
print("Regression ANOVA ", x_regression.shape)
for i in range(len(x_regression.columns)):
print(f'{x_regression.columns[i]}: {anova.scores_[i]}')
Regression ANOVA (127910, 19) a: 31.637196697611962 e: 1.7842577610228874 i: 1.6621992239947645 om: 1.0029678571803002 w: 0.9957726525102555 q: 11.095961249580835 ad: 32.88636968270681 per_y: 40.74103699750825 data_arc: 8.099009943513074 condition_code: 0.5430670263398989 n_obs_used: 4.7797993210753 H: 17.841635817082636 neo: 10.855769669869455 pha: 12.688456154927248 moid: 11.167513458143228 class: 9.089789442546143 n: 9.20323399210887 per: 40.74103699750813 ma: 1.0542990472843297
Higher values are desired for feature selection from the above ANOVA.
Need more through results explanation.
Based on the ANOVA, heatmap, and pairplot we can see that there are multiple features that have high importance when determining diameter. The final 10 features that will be used for the regression models are listed in the following table from highest to lowest ANOVA score (ANOVA values may vary. Listed values are from a sample execution).
| Feature | ANOVA score |
|---|---|
| per_y | 40.74103699750825 |
| per | 40.74103699750813 |
| ad | 32.88636968270681 |
| a | 31.637196697611962 |
| H | 17.841635817082636 |
| q | 11.095961249580835 |
| moid | 11.167513458143228 |
| neo | 10.855769669869455 |
| n | 9.20323399210887 |
| class | 9.089789442546143 |
These features listed above were chosen based on the visualizations and confirmed with the ANOVA. The visualizations showed us how all of the features correlated to eachother and was very large and slightly hard to read. With our ANOVA values, we can automatically maintain the features with the largest impacts to our target variable, diameter, which should lead to better predicitons.
The first method that will be visualized is a seaborn pairplot of the classification dataframe. This will visualize any relationships between features in a graphical format so that patterns may be easily recognized.
sns.set(style="dark", color_codes=True)
g = sns.pairplot(data=df_classification)
plt.show()
The physically hazardous asteroid feature is a binary 'Y' or 'N', thus a relationship between feature and target cannot be distinguished. This makes utilization of a heatmap much more important for visualization.
Next a heatmap on the classification data frame will be created. This heatmap will focus on correlation of the Physically Hazardous Asteroid pha feature.
plt.subplots(figsize = (16,15))
sns.heatmap(df_classification.corr(),annot=True, annot_kws={'size':10})
<AxesSubplot:>
From the heatmap above, multiple features can be visually identified to have a strong correlation with the pha feature.
Features with correlation greater than |0.7|:
The features listed above may prove to be useful when attempting to predict pha.
Working with large data sets and multiple models is expensive in processing power. Reducing dimensions by doing feature selection can cut down on this processing overhead. The seaborn heatmap has allowed for a handful of features to be selected from the classification data set. These features can now be used for modeling, but because the pairplot did not provide any insight into feature relationships, ANOVA will be used like it was with the regression data set.
Set the target feature 'pha' to y_classification and the other remaining features to x_classification
x_classification = df_classification.drop(columns=['pha'])
y_classification = df_classification['pha']
Now that x_classification and y_classification data sets contain values, an anova can be performed using the sklearn SelectKBest feature. The fit method compares x_classification and y_classification, this allows for a final dataset containing resultant values to be created. We will not use fit_transform for this dataset because we want to visualize how impactful the features are and do feature selection separately.
Use the SelectKBest function to create the anova then use the fit method to acquire the anova values. Print the result
print("Classification ANOVA")
# ANOVA on features on target to determine which features are significant
anova = SelectKBest(k=10)
# fitting ANOVA model with features and target
anova.fit(x_classification, y_classification)
for i in range(len(x_classification.columns)):
print(f'{x_classification.columns[i]}: {anova.scores_[i]}')
Classification ANOVA a: 973.1923712442439 e: 40724.4373212496 i: 2089.3272181553857 om: 177.318655590355 q: 3442.90119996731 ad: 161.2718440409309 per_y: 35.739245711537 data_arc: 0.02225635815843119 n_obs_used: 224.5891251242662 H: 14389.652109235822 neo: 333478.19148936146 moid: 2585.0191277389204 class: 211069.42054369886 n: 11441.341732571831
Higher values are desired for feature selection from the above ANOVA.
Based on the ANOVA, heatmap, and pairplot we can see that there are multiple features that have high importance when classifying pha. The final 10 features that will be used for the regression models are listed in the following table from highest to lowest ANOVA score (ANOVA values may vary. Listed values are from a sample execution).
| Feature | ANOVA score |
|---|---|
| neo | 339015.87 |
| class | 213978.95 |
| e | 40291.96 |
| H | 14018.56 |
| n | 11261.44 |
| q | 3635.74 |
| moid | 2722.93 |
| i | 2339.11 |
| om | 234.89 |
| n_obs_used | 153.41 |
These features were chosen for from visualization, and confirmed with the ANOVA. Because ANOVA gives a concrete numeric value, its results are valued more than visualization.
The key features have been selected using both the visualization and ANOVA methods. Now that these features have been identified, machine learning approaches can be used to model the data. The 10 selected features are displayed in the following table for regression and classification. Several algorithms will be tested on both the classification and regression data sets, the goal is to find out which algorithms provide the most desirable results and why they work.
| No. | Regression | Classification |
|---|---|---|
| 1 | per_y | neo |
| 2 | per | class |
| 3 | ad | e |
| 4 | a | H |
| 5 | H | n |
| 6 | q | q |
| 7 | moid | moid |
| 8 | neo | i |
| 9 | n | om |
| 10 | class | n_obs_used |
We tried multiple models for our regression prediction. When it came to parameter tuning, some took an excessive amount of resources and we chose to look elsewhere. If a model performed badly after gridsearch and 10-fold cross validation, we looked into more data and other methods of improving but inevitably found other models that performed well with less tuning and less computational cost.
All attempted models
| Algorithm | Supervised? | Regression or Classification? | Reasoning |
|---|---|---|---|
| Random Forest | Supervised | Regression | High performing model, well known |
| K-Nearest Neighbor | Unsupervised | Regression | Simple data-driven model |
| Stochastic Gradient Descent | Supervised | Regression | Good for large amounts of data |
| Gradient Boosting Regressor | Supervised | Regression | High performing model |
| Lasso | Supervised | Regression | Feature selection |
| Support Vector Machine | Supervised | Regression | Can model both linear and non-linear relationships between variables |
| Logistic Regression | Supervised | Classification | Base model for predicting probability of target |
| Support Vector Machine | Supervised | Classification | Works well for low-dimensional data |
| Decision Tree | Supervised | Classification | Easy to compute and explain implementation |
Diameter is a continuous target and thus requires a regression predictive model. We decided to use K-Nearest Neighbors, Epsilon-Support Vector, Gradient Boosting, and Random Forest algorithms to predict the diameter. We followed the following steps to ensure the best results were produced:
Importing all libraries for regression models
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler as ss
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE, mean_absolute_percentage_error as mape
import matplotlib.pyplot as plt
Spilt the data into training and testing sets using test_size = 0.2 and random_state = 1. This ensures that 80% of the dataset will be used for the training set, while the remaining 20% will be used the testing set.
Use sklearn train_test_split method to split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(bestX , y_regression, test_size = 0.2, random_state=1)
KNN is a simple data-driven model. KNN is a lazy algorithm that requires no learning, but has proven to be useful in real-world predictions by approximating the association of a continuous target based on a provided data.
Use sklearn KNeighborsRegressor fit method to train a KNN model
knn = KNeighborsRegressor()
knn.fit(x_train, y_train)
print("Base KNN score:", knn.score(x_test,y_test))
Base KNN score: 0.6613847267130195
The KNN score from above is the coefficient of determination R2 and ranges from 0 to 1.0. For KNN regressor with default parameters, the score was only .66 which makes this current model unreliable and does not meet our goal measures. We now look for ways to improve upon this base model by hyper-parameter tuning with sklearn model selection GridSearchCV algorithm.
Use sklearn GridSearchCV fit method for paramater tuning
"""
param_grid = {'n_neighbors' : [3],
'weights' : ['distance'],
'metric' : ['chebyshev','euclidean', ]
}
gs = GridSearchCV(estimator=knn, param_grid=param_grid, scoring='r2', cv=10)
gs = gs.fit(x_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
"""
"\nparam_grid = {'n_neighbors' : [3],\n 'weights' : ['distance'],\n 'metric' : ['chebyshev','euclidean', ]\n }\n\ngs = GridSearchCV(estimator=knn, param_grid=param_grid, scoring='r2', cv=10)\ngs = gs.fit(x_train, y_train)\nprint(gs.best_params_)\nprint(gs.best_score_)\n"
After running the grid search with the parameter grid above, the following parameters were selected as the best performing
The KNN model will now use the best performing parameters from the grid search to make an optimal model and print its results.
Use sklearn KNeighborsRegressor fit method to train a KNN model
bestKNN = KNeighborsRegressor(n_neighbors=3, weights='distance', metric='euclidean')
bestKNN.fit(x_train, y_train)
pred_knn_opt = bestKNN.predict(x_test)
print('Optimal KNN: ', bestKNN.score(x_test,y_test))
print('Optimal MSE: ', MSE(pred_knn_opt,y_test))
print("Optimal MAPE: ", mape(y_test, pred_knn_opt))
Optimal KNN: 0.6367102384591373 Optimal MSE: 35.48191939835578 Optimal MAPE: 0.2567843715206486
This optimal KNN model still does not meet our goals for regression. 63.6% R-squared means our data is not fitting well with this model.
K-fold cross-validation is a powerful statistical method that estimates the skill of a model against unseen data. This method of cross-validation shuffles the data, splits it into k groups, removes one group, trains the model, and tests the model against the removed group.
Use sklearn cross_val_score method to validate the model with k-folds
cross_val_KNN = cross_val_score(bestKNN, x_train, y_train, cv=10, n_jobs=16)
print("Optimal KNN CV mean score: ", cross_val_KNN.mean())
Optimal KNN CV mean score: 0.6346900496828048
The optimal KNN cross-validation score along with the other measures shows that KNN regressor is not a reliable prediction model for this dataset.
Use Matplotlib scatter method to create a scatter plot of Actual vs Predicted Diameter
color = 1 - (y_test / pred_knn_opt)
fig, ax = plt.subplots()
plt.scatter(y_test, pred_knn_opt, c=color, cmap='plasma', vmin=-1, vmax=1, alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax.set_xlabel("Actual Diameter")
ax.set_ylabel("Predicted Diameter")
plt.title("Actual vs Predicted Asteroid Diameter using KNN")
Text(0.5, 1.0, 'Actual vs Predicted Asteroid Diameter using KNN')
Comparing the scatter plot above to the linear line, which shows a 100% R-squared, we see that this model tends to under-predict diameters between 100 and 300. There are also a couple outliers far above that were over-predicted. Overall, this is not the model we want to choose to predict diameter.
Support Vector Regressor has the ability to model both linear and non-linear relationships between variables.
The selected feature data must be scaled to avoid features with larger ranges from dominating the other features.
z = (x - u) / s where x is the sample, u is the mean, and s is the standard deviation of a feature
Use StandardScaler fit_transform method to scale the data
sc = ss()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.fit_transform(x_test)
The selected feature data is now scaled into x_train_std and x_test_std and is ready to be used.
Use sklearn SVR fit method to train a SVR model
svr=SVR()
svr.fit(x_train_std, y_train)
pred_svr = svr.predict(x_test_std)
print("Base SVR score: ", svr.score(x_test_std, y_test))
print('Base MSE: ', MSE(pred_svr,y_test))
Base SVR score: 0.6559580484997284 Base MSE: 33.60201713643008
The SVR model with base parameters only scored .66. This score does not meet our goal measure, thus we move to hyper-parameter tuning with GridSearchCV.
Use sklearn GridSearchCV fit method for paramater tuning
"""
param_grid = [{'C' : [.001, .01, .1, 1, 10],
'epsilon' : [.001, .01, .1, 1, 10]}]
gs = GridSearchCV(estimator=svr, param_grid=param_grid, scoring='r2', cv=10, return_train_score=True, verbose=True, n_jobs=16)
svr.fit(x_train, ytrain)
gs = gs.fit(x_train, ytrain)
print(gs.best_params_)
print(gs.best_score_)
"""
"\nparam_grid = [{'C' : [.001, .01, .1, 1, 10],\n 'epsilon' : [.001, .01, .1, 1, 10]}]\ngs = GridSearchCV(estimator=svr, param_grid=param_grid, scoring='r2', cv=10, return_train_score=True, verbose=True, n_jobs=16)\nsvr.fit(x_train, ytrain)\ngs = gs.fit(x_train, ytrain)\nprint(gs.best_params_)\nprint(gs.best_score_)\n"
Parameter tuning for SVR will be accomplished with gridsearchCV on parameters C and epsilon ranging from 0.001 to 100 by multiples of 10.
SVR was selected after attempting Stochastic Gradient Descent. After over 72 hrs of parameter tuning, approximately .40 was the highest R2 score achieved. SGD Regression on this dataset required the max_iter to be changed from the default of 1,000 to 1,000,000 to ensure convergence occured. Parameters epsilon and eta0 made almost no impact on the SGD Regression score, but a slight difference in the alpha caused the score to jump from approximately .40 into an unrealistically large integer.
Grid search returned the best performing parameters:
The grid search will output the best performing parameters and we will use them to make an optimal SVR model and print the results.
Use sklearn SVR fit method to train a SVR model
svr=SVR(C=10, epsilon=1)
svr.fit(x_train_std, y_train)
pred_svr_opt = svr.predict(x_test_std)
print("Optimal SVR score: ", svr.score(x_test_std,y_test))
print('Optimal MSE: ', MSE(pred_svr_opt,y_test))
print("Optimal MAPE: ", mape(y_test, pred_svr_opt))
Optimal SVR score: 0.7644326792110565 Optimal MSE: 23.007476603988362 Optimal MAPE: 0.2608862305681349
The optimal SVR score falls short of the set R2 goal. A score of 76.4% indicates that the model is not reliable enough to make accurate predictions.
We now use cross-validation to ensure consistency with our previous R2 score.
Use sklearn cross_val_score method to validate the model with k-folds
cross_val_svr = cross_val_score(svr, x_train_std, y_train, cv=10, n_jobs=16)
print("Optimal SVR CV mean score: ", cross_val_svr.mean())
Optimal SVR CV mean score: 0.7611819114094261
The mean cross-validation score was consistent with the previous SVR score at 76%. This supports that SVR is not a good model for this data.
Use Matplot scatter method to create a scatter plot graph
color = 1 - (y_test / pred_svr_opt)
fig, ax = plt.subplots()
plt.scatter(y_test, pred_svr_opt, c=color, cmap='plasma', vmin=-1, vmax=1, alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax.set_xlabel("Actual Diameter")
ax.set_ylabel("Predicted Diameter")
plt.title("Actual vs Predicted Asteroid Diameter using SVR")
Text(0.5, 1.0, 'Actual vs Predicted Asteroid Diameter using SVR')
Gradient Boosting Regressor (GBR), a boosted decision tree algorithm, was selected in hopes that it's slower learning rate will yield a higher accuracy when attempting to predict the target. The GBR model consists of a combination of many weak learners in order to achieve a single strong learner. The two hyperparemeters for GBR algorithms are n_estimators and learning_rate where n_estimators represents the number of trees added to the overall model and learning_rate represents the rate in which the model learns. When setting these parameters it is best to keep in mind that although performance can be expected to increase when using a slower learning rate, more trees will be needed in order to train the model. Thus, to prevent overfitting, a compromise must be made when choosing these hyperparameters.
To understand the model well, we first run a base GBR model using it's default parameters values( n_estimators=100, learning_rate=0.1).
Use sklearn GradientBoostingRegressor fit method to train a GBR model
gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(x_train, y_train)
print("Base GBR score: ",gbr.score(x_test, y_test))
Base GBR score: 0.8860433879474203
The base GBR score of 88.6% is much higher than the goal we established. This model is reliable and can be used to make predictions. We now look to see if we can raise the GBR score by hyper-parameter tuning with GridSearchCV.
Use sklearn GridSearchCV fit method for paramater tuning
"""
param_grid = {'n_estimators' : [105],
}
gbr = GradientBoostingRegressor()
gs = GridSearchCV(estimator=gbr, param_grid=param_grid, scoring='r2', cv=10)
gs = gs.fit(x_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
"""
"\nparam_grid = {'n_estimators' : [105],\n }\ngbr = GradientBoostingRegressor()\ngs = GridSearchCV(estimator=gbr, param_grid=param_grid, scoring='r2', cv=10)\ngs = gs.fit(x_train, y_train)\nprint(gs.best_params_)\nprint(gs.best_score_)\n"
After running the grid search with the parameter grid above, the following parameters were selected as the best performing
We can now use the provided parameters, n_estimators=105, to create an optimally tuned GBR model.
Use sklearn GradientBoostingRegressor fit method to train a GBR model
gbr_opt = GradientBoostingRegressor(n_estimators=105, random_state=0)
gbr_opt.fit(x_train, y_train)
pred_gbr_opt = gbr_opt.predict(x_test)
print("Optimal GBR score: ",gbr_opt.score(x_test, y_test))
print('Optimal MSE: ', MSE(pred_gbr_opt,y_test))
print("Optimal MAPE: ", mape(y_test, pred_gbr_opt))
Optimal GBR score: 0.8855842001165629 Optimal MSE: 11.174804850386314 Optimal MAPE: 0.23529300161326797
The optimal GBR score surpasses the set R2 goal. A score of 88.6% indicates that the model may be considered reliable enough to make accurate predictions.
We now use cross-validation to ensure consistency with our previous R2 score.
Use sklearn cross_val_score method to validate the model with k-folds
cross_val_GBR = cross_val_score(gbr_opt, x_train, y_train, cv=10, n_jobs=16)
print("Optimal GBR CV mean score: ", cross_val_GBR.mean())
Optimal GBR CV mean score: 0.8888411931245235
Our optimal Random Forest averages around 87.7% in the cross validation meaning we are not overfit and due to it being close to our other score we can say that this model does perform well for this dataset.
Use Matplot scatter method to create a scatter plot graph
color = 1 - (y_test / pred_gbr_opt)
fig, ax = plt.subplots()
plt.scatter(y_test, pred_gbr_opt, c=color, cmap='plasma', vmin=-1, vmax=1, alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax.set_xlabel("Actual Diameter")
ax.set_ylabel("Predicted Diameter")
plt.title("Actual vs Predicted Asteroid Diameter using GBR")
Text(0.5, 1.0, 'Actual vs Predicted Asteroid Diameter using GBR')
The next model chosen was Random Forest. This model is known for being high-performing but can be easily overfit so keeping an eye on this model is a must. The model is imported with sci-kit learn's ensemble library. To understand the model well, we first run a base RF model to see how it does with no parameters on the data.
Use sklearn's RandomForestRegressor fit method to train a RF model
from sklearn.ensemble import RandomForestRegressor
rf_base = RandomForestRegressor()
rf_base.fit(x_train, y_train)
pred_rf_base = rf_base.predict(x_test)
print("Base RF score: ",rf_base.score(x_test,y_test))
print('Base MSE: ', MSE(pred_rf_base,y_test))
print("Base MAPE: ", mape(y_test, pred_rf_base))
Base RF score: 0.8829676869404622 Base MSE: 11.430355431347833
Wow! With no parameters the Random Forest Regressor gave a score within our goals! The MSE is higher than we would want but is still relatively low. We will try running a parameter grid search and see if we can increase the score.
Use sklearn GridSearchCV fit method for paramater tuning
"""
param_grid = [{'n_estimators' : [100, 150, 200, 250, 300],
'max_depth' : [None, 10, 20, 30, 40],
'min_samples_split' : [2, 3, 4]}]
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='r2', cv=10, return_train_score=True)
gs.fit(xtrain,ytrain)
print("\nBest parameters: ",gs.best_params_)
print(gs.best_score_)
"""
'\nparam_grid = [{\'n_estimators\' : [100, 150, 200, 250, 300],\n \'max_depth\' : [None, 10, 20, 30, 40],\n \'min_samples_split\' : [2, 3, 4]}]\n\ngs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring=\'r2\', cv=10, return_train_score=True)\ngs.fit(xtrain,ytrain)\nprint("\nBest parameters: ",gs.best_params_)\nprint(gs.best_score_)\n'
After running the grid search with the parameter grid above, the following parameters were selected as the best performing
The best score for the grid search wasn't much higher than our original score but we will use the parameters as our optimal model to see the output and compare to our base.
Use sklearn RandomForestRegressor fit method to train a RF model
rf_opt = RandomForestRegressor(n_estimators=100, max_depth=20, min_samples_split=3)
rf_opt.fit(x_train, y_train)
pred_rf_opt = rf_opt.predict(x_test)
print("Optimal RF score: ", rf_opt.score(x_test,y_test))
print('Optimal MSE: ', MSE(pred_rf_opt,y_test))
print("Optimal MAPE: ", mape(y_test, pred_rf_opt))
Optimal RF score: 0.8896827196323669 Optimal MSE: 10.774509123648663 Optimal MAPE: 0.2217349708946194
The optimal Random Forest model isn't too much higher than our base but we can see that the measures all improved which shows the parameter tuning was a success. To better validate our model a 10-fold cross validation will also be performed.
Use sklearn cross_val_score method to validate the model with k-folds
cross_val_rf = cross_val_score(rf_opt, x_train, y_train, cv=10, n_jobs=16)
print("Optimal RF CV mean score: ", cross_val_rf.mean())
Optimal RF CV mean score: 0.8773030997341262
Our optimal Random Forest averages around 87.7% in the cross validation meaning we are not overfit and due to it being close to our other score we can say that this model does perform well for this dataset.
Use Matplotlib scatter method to create a scatter plot to visualize Actual vs Predicted Diameters
color = 1 - (y_test / pred_rf_opt)
fig, ax = plt.subplots()
plt.scatter(y_test, pred_rf_opt, c=color, cmap='plasma', vmin=-1, vmax=1, alpha=0.5)
# ax.scatter(y_test, pred)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
ax.set_xlabel("Actual Diameter")
ax.set_ylabel("Predicted Diameter")
plt.title("Actual vs Predicted Asteroid Diameter using RF")
Text(0.5, 1.0, 'Actual vs Predicted Asteroid Diameter using RF')
This scatter plot shows that when comparing to the 100% R-Squared, the linear line, we do pretty well when predicting small diameters and start to slightly under/over predict the 150-300 range. Other than that there is only one large outlier that was over predicted but compared to the other models, we can say that the Random Forest Regressor performs well and could be used to predict an asteroids diameter.
| Model | R-Squared | MSE | MAPE | 10-Fold CV |
|---|---|---|---|---|
| KNN | 63.7% | 35.48 | 25.68% | 63.47% |
| GBR | 88.6% | 11.17 | 23.53% | 88.88% |
| SVR | 76.4% | 23.01 | 26.09% | 76.12% |
| RF | 88.9% | 10.77 | 22.17% | 87.73% |
Our two best performing models are the Gradient Boosting Regressor and Random Forest. Both achieved all of our goals for Regression and therefore would be useful for predicting an unknown asteroids diameter.
The feature Physically Hazardous Asteroid or pha only has two options, "Y" or "N". Therefore, using this target we needed to use classification to find the correct class to give each asteroid. To get the best possible results we used the following steps:
Importing all libraries for classification models
from sklearn.linear_model import LogisticRegression as lr
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import numpy as np
Spilt the data into training and testing sets using test_size = 0.2 and random_state = 1. This ensures that 80% of the dataset will be used for the training set, while the remaining 20% will be used the testing set.
Use sklearn train_test_split method to split the data into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(x_classification, y_classification, stratify=y_classification, test_size=.3, random_state=1)
pca = PCA(n_components=2)
xtrain = pca.fit_transform(xtrain,ytrain)
xtest = pca.transform(xtest)
Logistic Regression is a simple and easy to use model so it was chosen as the baseline of all of our classification models. This supervised algorithm outputs the probabilities of samples belonging to a certain class.
Use sklearn to create logistic regression model as well as output classification report and confusion matrix.
lreg = lr()
lreg.fit(xtrain, ytrain)
pred_lreg = lreg.predict(xtest)
print("Base Logit: ", lreg.score(xtest,ytest))
cm = confusion_matrix(ytest, pred_lreg)
plt.figure()
plot_confusion_matrix(lreg, xtest, ytest)
plt.title("Logistic Regression Model - Confusion Matrix")
plt.xticks([0,1], ['No', 'Yes'])
plt.yticks([0,1], ['No', 'Yes'])
plt.show()
print(classification_report(pred_lreg, ytest))
Base Logit: 0.6776206322795341
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<Figure size 432x288 with 0 Axes>
precision recall f1-score support
0.0 0.65 0.69 0.67 2256
1.0 0.71 0.67 0.69 2552
accuracy 0.68 4808
macro avg 0.68 0.68 0.68 4808
weighted avg 0.68 0.68 0.68 4808
lreg_scores = cross_val_score(lreg, xtest, ytest, cv=5)
print('Average 10-fold score: %.3f' % np.mean(lreg_scores))
Average 10-fold score: 0.678
This model is a powerful supervised algorithm that maximizes the number of points within the boundaries with minimal error and maximal margin.
Use sklearn to create svc model and print the classification report and confusion matrix.
svc = SVC()
svc.fit(xtrain,ytrain)
pred_svc = svc.predict(xtest)
cm = confusion_matrix(ytest, pred_svc)
plt.figure()
plot_confusion_matrix(svc, xtest, ytest)
plt.title("SVC Model - Confusion Matrix")
plt.xticks([0,1], ['No', 'Yes'])
plt.yticks([0,1], ['No', 'Yes'])
plt.show()
print(classification_report(pred_svc, ytest))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<Figure size 432x288 with 0 Axes>
precision recall f1-score support
0.0 0.66 0.67 0.66 2396
1.0 0.67 0.66 0.67 2412
accuracy 0.67 4808
macro avg 0.67 0.67 0.67 4808
weighted avg 0.67 0.67 0.67 4808
svc_scores = cross_val_score(svc, xtest, ytest, cv=5)
print('Average 10-fold score: %.3f' % np.mean(svc_scores))
Average 10-fold score: 0.657
Use higher regularization to help control our model capacity and hopefully attain better scores.
svc_opt = SVC(C=150)
svc_opt.fit(xtrain,ytrain)
pred_svc_opt = svc_opt.predict(xtest)
cm = confusion_matrix(ytest, pred_svc_opt)
plt.figure()
plot_confusion_matrix(svc_opt, xtest, ytest)
plt.title("SVC Model - Confusion Matrix")
plt.xticks([0,1], ['No', 'Yes'])
plt.yticks([0,1], ['No', 'Yes'])
plt.show()
print(classification_report(pred_svc_opt, ytest))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<Figure size 432x288 with 0 Axes>
precision recall f1-score support
0.0 0.75 0.79 0.77 2269
1.0 0.80 0.76 0.78 2539
accuracy 0.77 4808
macro avg 0.77 0.78 0.77 4808
weighted avg 0.78 0.77 0.77 4808
svc_scores = cross_val_score(svc_opt, xtest, ytest, cv=5)
print('Average 10-fold score: %.3f' % np.mean(svc_scores))
Average 10-fold score: 0.753
This model actually performed well compared to our baseline at ~77%. It is still below our targets so other models are needed since this model may already be at a local optima.
Since Random Forest did exceptionally with regression, we decided to give it a chance in classification as well.
Use sklearn to create decision tree model as well as print classification report and confusion matrix.
dt = DecisionTreeClassifier()
dt.fit(xtrain,ytrain)
pred_dt = dt.predict(xtest)
cm = confusion_matrix(ytest, pred_dt)
plt.figure()
plot_confusion_matrix(dt, xtest, ytest)
plt.title("DT Model - Confusion Matrix")
plt.xticks([0,1], ['No', 'Yes'])
plt.yticks([0,1], ['No', 'Yes'])
plt.show()
print(classification_report(pred_dt, ytest))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<Figure size 432x288 with 0 Axes>
precision recall f1-score support
0.0 0.79 0.78 0.78 2411
1.0 0.78 0.79 0.78 2397
accuracy 0.78 4808
macro avg 0.78 0.78 0.78 4808
weighted avg 0.78 0.78 0.78 4808
dt_scores = cross_val_score(dt, xtest, ytest, cv=10)
print('Average 10-fold score: %.3f' % np.mean(dt_scores))
Average 10-fold score: 0.776
This model's base performs as well as SVC's tuned model giving hope that adding some tuning will increase the scores of this model. Some smaller tests were ran since the model executed rather quickly. For this model we ended up with min samples per leaf being 25 and a max depth of 7.
dt_opt = DecisionTreeClassifier(min_samples_leaf=25, max_depth=7)
dt_opt.fit(xtrain,ytrain)
pred_dt_opt = dt_opt.predict(xtest)
cm = confusion_matrix(ytest, pred_dt_opt)
plt.figure()
plot_confusion_matrix(dt_opt, xtest, ytest)
plt.title("DT Model - Confusion Matrix")
plt.xticks([0,1], ['No', 'Yes'])
plt.yticks([0,1], ['No', 'Yes'])
plt.show()
print(classification_report(pred_dt_opt, ytest))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<Figure size 432x288 with 0 Axes>
precision recall f1-score support
0.0 0.89 0.80 0.84 2680
1.0 0.78 0.88 0.82 2128
accuracy 0.83 4808
macro avg 0.83 0.84 0.83 4808
weighted avg 0.84 0.83 0.83 4808
dt_scores_opt = cross_val_score(dt_opt, xtest, ytest, cv=10)
print('Average 10-fold score: %.3f' % np.mean(dt_scores_opt))
Average 10-fold score: 0.820
Awesome! We hit almost all of our success targets. The precision for class 1 is slighly under but everything performs well overall and it can be said that this model can be used to predict if an unknown asteroid is physically hazardous or not.
| Sucess Measures: | CV Score | F1 | Precision | Recall |
|---|---|---|---|---|
| Logistic Regression | 67.8% | 68% | 68% | 68% |
| Support Vector Machine | 75.3% | 77% | 78% | 77% |
| Decision Tree | 82.0% | 83% | 84% | 83% |
Different Ideas for dealing with the imbalance include:
After attempting to use the raw data to upsample and downsample we concluded that we needed more usable data instead of resampling our available data and the VAE was developed to address the issue. To do a simple sanity check we compared the means and standard deviations for each feature to ensure the autoencoder was creating usable data that closely resembled the old data.
Grid Search: Every regression model used a grid search with 10-fold CV to do parameter tuning. From these we obtained our optimal models and printed our results.
Machine Learning approaches were able to predict the diameter of an asteroid and also determine if the asteroid is physically hazardous to the Earth. The diameter was predicted by using regression models. When analyzing the data, a large number of entries for diameter were null, which restricted the size of the data set. Of the four regression models used, the best performing models were Random Forests and Gradient Boosting. Both of these models surpassed the project goals with scores of 88.89% and 88.56% respectively. Additionally, the goals were met for predicting whether or not an asteroid is considered physically hazardous to Earth. A classification approach was used for this prediction, and initially scores were very strong due to a class imbalance, which was assessed using a Variational Auto Encoder. After this, the data size was limited and the best scoring model was Decision Trees with a cross validation score of 82.6%. These results are promising to both the usability of machine learning approaches and the prospect of building a system to predict the presence of a physically hazardous asteroid.
Machine Learning in Python (2021), Scikit-learn. Accessed: Oct. 7, 2021. [Online]. Available: https://scikit-learn.org/stable/
Matplotlib 3.5.0 (2021), Matplotlib. Accessed: Oct. 25, 2021. [Online]. Available: https://matplotlib.org/stable/index.html
NumPy (2021), NumPy. Accessed: Oct. 27, 2021. [Online]. Available: https://numpy.org/
Pandas (2021), Pandas. Accessed: Oct. 21, 2021. [Online]. Available: https://pandas.pydata.org/docs/reference/index.html
Seaborn 0.11.2 (2021), Seaborn. Accessed: Oct. 25, 2021. [Online]. Available: https://seaborn.pydata.org/
TensorFlow Core v2.7.0 (2021), TensorFlow. Accessed: Oct. 27, 2021. [Online]. Available: https://www.tensorflow.org/api_docs/python/tf
V. Basu. "Open Asteroid Dataset." (2019). https://www.kaggle.com/basu369victor/prediction-of-asteroid-diameter?select=Asteroid_Updated.csv
| Task | Colin | Leah | Jake | Ryan |
|---|---|---|---|---|
| Introduction | ✅ | ✅ | ||
| Problem Statement | ✅ | ✅ | ||
| Related Work | ✅ | ✅ | ||
| Data Management | ✅ | ✅ | ✅ | ✅ |
| Machine Learning Approaches | ✅ | ✅ | ✅ | ✅ |
| Regression Models | ✅ | ✅ | ✅ | ✅ |
KNN |
✅ | ✅ | ||
SVR |
✅ | |||
GBR |
✅ | ✅ | ||
RF |
✅ | ✅ | ||
| VAE | ✅ | |||
| Classification Models | ✅ | |||
Logistic Regression |
✅ | |||
SVC |
✅ | |||
DT |
✅ | |||
| Experiments | ✅ | ✅ | ✅ | ✅ |
| Conclusion | ✅ | ✅ | ||
| References | ✅ | ✅ | ||
| Total Contribution | 9 | 12 | 7 | 10 |